Search results for "fail-stop error"

showing 2 items of 2 documents

A Generic Approach to Scheduling and Checkpointing Workflows

2018

This work deals with scheduling and checkpointing strategies to execute scientific workflows on failure-prone large-scale platforms. To the best of our knowledge, this work is the first to target fail-stop errors for arbitrary workflows. Most previous work addresses soft errors, which corrupt the task being executed by a processor but do not cause the entire memory of that processor to be lost, contrarily to fail-stop errors. We revisit classical mapping heuristics such as HEFT and MinMin and complement them with several checkpointing strategies. The objective is to derive an efficient trade-off between checkpointing every task (CkptAll), which is an overkill when failures are rare events, …

Computer scienceworkflowDistributed computing02 engineering and technologyTheoretical Computer ScienceScheduling (computing)résiliencecheckpointfail-stop error0202 electrical engineering electronic engineering information engineeringRare eventsOverhead (computing)[INFO]Computer Science [cs]Resilience (network)resilienceComplement (set theory)020203 distributed computing020206 networking & telecommunications020202 computer hardware & architecture[INFO.INFO-PF]Computer Science [cs]/Performance [cs.PF]Task (computing)WorkflowHardware and Architectureerreur fatale[INFO.INFO-DC]Computer Science [cs]/Distributed Parallel and Cluster Computing [cs.DC]HeuristicsSoftware

researchProduct

Checkpointing Workflows for Fail-Stop Errors

2017

International audience; We consider the problem of orchestrating the exe- cution of workflow applications structured as Directed Acyclic Graphs (DAGs) on parallel computing platforms that are subject to fail-stop failures. The objective is to minimize expected overall execution time, or makespan. A solution to this problem consists of a schedule of the workflow tasks on the available processors and of a decision of which application data to checkpoint to stable storage, so as to mitigate the impact of processor failures. For general DAGs this problem is hopelessly intractable. In fact, given a solution, computing its expected makespan is still a difficult problem. To address this challenge,…

researchProduct